Unsupervised Semantic Action Discovery from Video Collections

نویسندگان

Ozan Sener

Amir Roshan Zamir

Chenxia Wu

Silvio Savarese

Ashutosh Saxena

چکیده

Human communication takes many forms, including speech, text and instructional videos. It typically has an underlying structure, with a starting point, ending, and certain objective steps between them. In this paper, we consider instructional videos where there are tens of millions of them on the Internet. We propose a method for parsing a video into such semantic steps in an unsupervised way. Our method is capable of providing a semantic “storyline” of the video composed of its objective steps. We accomplish this using both visual and language cues in a joint generative model. Our method can also provide a textual description for each of the identified semantic steps and video segments. We evaluate our method on a large number of complex YouTube videos and show that our method discovers semantically correct instructions for a variety of tasks. 1 O. Sener Cornell University, Ithaca NY 14853, USA E-mail: [email protected] A.R. Zamir Stanford University, Stanford CA 94305, USA E-mail: [email protected] C. Wu Cornell University, Ithaca NY 14853, USA E-mail: [email protected] S. Savarese Stanford University, Stanford CA 94305, USA E-mail: [email protected] A. Saxena Brain of Things Inc, Cupertino CA 95014, USA E-mail: [email protected] 1 First version of this paper appeared in ICCV 2015. This extended version has more details on the learning algorithm and hierarchical clustering with full derivation, additional analysis on the robustness to the subtitle noise, and a novel application on robotics.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Action Change Detection in Video Based on HOG

Background and Objectives: Action recognition, as the processes of labeling an unknown action of a query video, is a challenging problem, due to the event complexity, variations in imaging conditions, and intra- and inter-individual action-variability. A number of solutions proposed to solve action recognition problem. Many of these frameworks suppose that each video sequence includes only one ...

متن کامل

On Linking Heterogeneous Dataset Collections

Link discovery is the problem of linking entities between two or more datasets, based on some (possibly unknown) specification. A blocking scheme is a one-to-many mapping from entities to blocks. Blocking methods avoid O(n) comparisons by clustering entities into blocks, and limiting the evaluation of link specifications to entity pairs within blocks. Current link-discovery blocking methods exp...

متن کامل

Unsupervised Alignment of Actions in Video with Text Descriptions

Advances in video technology and data storage have made large scale video data collections of complex activities readily accessible. An increasingly popular approach for automatically inferring the details of a video is to associate the spatiotemporal segments in a video with its natural language descriptions. Most algorithms for connecting natural language with video rely on pre-aligned superv...

متن کامل

Top-down Analysis of Low-level Object Relatedness Leading to Semantic Understanding of Medieval Image Collections

The aim of image understanding, which is a long standing goal of computer vision, is to develop algorithms with which computers can advance to the semantic content of images. One ability of such algorithms would be the automatic discovery of relations between different objects in large collections of images. To analyze this relatedness we present an unsupervised and a semi-supervised approach f...

متن کامل

Extracting Latent Attributes from Video Scenes Using Text as Background Knowledge

We explore the novel task of identifying latent attributes in video scenes, such as the mental states of actors, using only large text collections as background knowledge and minimal information about the videos, such as activity and actor types. We formalize the task and a measure of merit that accounts for the semantic relatedness of mental state terms. We develop and test several largely uns...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1605.03324 شماره

صفحات -

تاریخ انتشار 2016

Unsupervised Semantic Action Discovery from Video Collections

نویسندگان

چکیده

منابع مشابه

Action Change Detection in Video Based on HOG

On Linking Heterogeneous Dataset Collections

Unsupervised Alignment of Actions in Video with Text Descriptions

Top-down Analysis of Low-level Object Relatedness Leading to Semantic Understanding of Medieval Image Collections

Extracting Latent Attributes from Video Scenes Using Text as Background Knowledge

عنوان ژورنال:

اشتراک گذاری